Nearly Optimal Exploration-Exploitation Decision Thresholds
نویسنده
چکیده
While in general trading off exploration and exploitation in reinforcement learning is hard, under some formulations relatively simple solutions exist. Optimal decision thresholds for the multi-armed bandit problem, one for the infinite horizon discounted reward case and one for the finite horizon undiscounted reward case are derived, which make the link between the reward horizon, uncertainty and the need for exploration explicit. From this result follow two practical approximate algorithms, which are illustrated experimentally.
منابع مشابه
Learning near-optimal search in a minimal explore/exploit task
How well do people search an environment for non-depleting resources of different quality, where it is necessary to switch between exploring for new resources and exploiting those already found? Employing a simple card selection task to study exploitation and exploration, we find that the total resources accrued, the number of switches between exploring and exploiting, and the number of trials ...
متن کاملHuman and Optimal Exploration and Exploitation in Bandit Problems
We consider a class of bandit problems in which a decision-maker must choose between a set of alternativeseach of which has a fixed but unknown rate of rewardto maximize their total number of rewards over a short sequence of trials. Solving these problems requires balancing the need to search for highly-rewarding alternatives with the need to capitalize on those alternatives already known to be...
متن کاملPsychological Models of Human and Optimal Performance in Bandit Problems Action Editor: Andrew Howes
In bandit problems, a decision-maker must choose between a set of alternatives, each of which has a fixed but unknown rate of reward, to maximize their total number of rewards over a sequence of trials. Performing well in these problems requires balancing the need to search for highly-rewarding alternatives, with the need to capitalize on those alternatives already known to be reasonably good. ...
متن کاملPsychological models of human and optimal performance in bandit problems
In bandit problems, a decision-maker must choose between a set of alternatives, each of which has a fixed but unknown rate of reward, to maximize their total number of rewards over a sequence of trials. Performing well in these problems requires balancing the need to search for highly-rewarding alternatives, with the need to capitalize on those alternatives already known to be reasonably good. ...
متن کاملAn Improved Bat Algorithm with Grey Wolf Optimizer for Solving Continuous Optimization Problems
Metaheuristic algorithms are used to solve NP-hard optimization problems. These algorithms have two main components, i.e. exploration and exploitation, and try to strike a balance between exploration and exploitation to achieve the best possible near-optimal solution. The bat algorithm is one of the metaheuristic algorithms with poor exploration and exploitation. In this paper, exploration and ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006